{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "3766a2be",
   "metadata": {
    "origin_pos": 1
   },
   "source": [
    "# Data Preprocessing\n",
    ":label:`sec_pandas`\n",
    "\n",
    "So far, we have been working with synthetic data\n",
    "that arrived in ready-made tensors.\n",
    "However, to apply deep learning in the wild\n",
    "we must extract messy data \n",
    "stored in arbitrary formats,\n",
    "and preprocess it to suit our needs.\n",
    "Fortunately, the *pandas* [library](https://pandas.pydata.org/) \n",
    "can do much of the heavy lifting.\n",
    "This section, while no substitute \n",
    "for a proper *pandas* [tutorial](https://pandas.pydata.org/pandas-docs/stable/user_guide/10min.html),\n",
    "will give you a crash course\n",
    "on some of the most common routines.\n",
    "\n",
    "## Reading the Dataset\n",
    "\n",
    "Comma-separated values (CSV) files are ubiquitous \n",
    "for the storing of tabular (spreadsheet-like) data.\n",
    "In them, each line corresponds to one record\n",
    "and consists of several (comma-separated) fields, e.g.,\n",
    "\"Albert Einstein,March 14 1879,Ulm,Federal polytechnic school,field of gravitational physics\".\n",
    "To demonstrate how to load CSV files with `pandas`, \n",
    "we (**create a CSV file below**) `../data/house_tiny.csv`. \n",
    "This file represents a dataset of homes,\n",
    "where each row corresponds to a distinct home\n",
    "and the columns correspond to the number of rooms (`NumRooms`),\n",
    "the roof type (`RoofType`), and the price (`Price`).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 1,
   "id": "2bea3f1b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:31:20.494944Z",
     "iopub.status.busy": "2023-08-18T19:31:20.494290Z",
     "iopub.status.idle": "2023-08-18T19:31:20.505610Z",
     "shell.execute_reply": "2023-08-18T19:31:20.504677Z"
    },
    "origin_pos": 2,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [],
   "source": [
    "import os\n",
    "\n",
    "os.makedirs(os.path.join('..', 'data'), exist_ok=True)\n",
    "data_file = os.path.join('..', 'data', 'house_tiny.csv')\n",
    "with open(data_file, 'w') as f:\n",
    "    f.write('''NumRooms,RoofType,Price\n",
    "NA,NA,127500\n",
    "2,NA,106000\n",
    "4,Slate,178100\n",
    "NA,NA,140000''')"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a73772a6",
   "metadata": {
    "origin_pos": 3
   },
   "source": [
    "Now let's import `pandas` and load the dataset with `read_csv`.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 2,
   "id": "9ae201e1",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:31:20.510380Z",
     "iopub.status.busy": "2023-08-18T19:31:20.509849Z",
     "iopub.status.idle": "2023-08-18T19:31:21.105668Z",
     "shell.execute_reply": "2023-08-18T19:31:21.104596Z"
    },
    "origin_pos": 4,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms RoofType   Price\n",
      "0       NaN      NaN  127500\n",
      "1       2.0      NaN  106000\n",
      "2       4.0    Slate  178100\n",
      "3       NaN      NaN  140000\n"
     ]
    }
   ],
   "source": [
    "import pandas as pd\n",
    "\n",
    "data = pd.read_csv(data_file)\n",
    "print(data)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2bcb4eec",
   "metadata": {
    "origin_pos": 5
   },
   "source": [
    "## Data Preparation\n",
    "\n",
    "In supervised learning, we train models\n",
    "to predict a designated *target* value,\n",
    "given some set of *input* values. \n",
    "Our first step in processing the dataset\n",
    "is to separate out columns corresponding\n",
    "to input versus target values. \n",
    "We can select columns either by name or\n",
    "via integer-location based indexing (`iloc`).\n",
    "\n",
    "You might have noticed that `pandas` replaced\n",
    "all CSV entries with value `NA`\n",
    "with a special `NaN` (*not a number*) value. \n",
    "This can also happen whenever an entry is empty,\n",
    "e.g., \"3,,,270000\".\n",
    "These are called *missing values* \n",
    "and they are the \"bed bugs\" of data science,\n",
    "a persistent menace that you will confront\n",
    "throughout your career. \n",
    "Depending upon the context, \n",
    "missing values might be handled\n",
    "either via *imputation* or *deletion*.\n",
    "Imputation replaces missing values \n",
    "with estimates of their values\n",
    "while deletion simply discards \n",
    "either those rows or those columns\n",
    "that contain missing values. \n",
    "\n",
    "Here are some common imputation heuristics.\n",
    "[**For categorical input fields, \n",
    "we can treat `NaN` as a category.**]\n",
    "Since the `RoofType` column takes values `Slate` and `NaN`,\n",
    "`pandas` can convert this column \n",
    "into two columns `RoofType_Slate` and `RoofType_nan`.\n",
    "A row whose roof type is `Slate` will set values \n",
    "of `RoofType_Slate` and `RoofType_nan` to 1 and 0, respectively.\n",
    "The converse holds for a row with a missing `RoofType` value.\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 3,
   "id": "f92e80b6",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:31:21.109879Z",
     "iopub.status.busy": "2023-08-18T19:31:21.109243Z",
     "iopub.status.idle": "2023-08-18T19:31:21.120081Z",
     "shell.execute_reply": "2023-08-18T19:31:21.119081Z"
    },
    "origin_pos": 6,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms  RoofType_Slate  RoofType_nan\n",
      "0       NaN           False          True\n",
      "1       2.0           False          True\n",
      "2       4.0            True         False\n",
      "3       NaN           False          True\n"
     ]
    }
   ],
   "source": [
    "inputs, targets = data.iloc[:, 0:2], data.iloc[:, 2]\n",
    "inputs = pd.get_dummies(inputs, dummy_na=True)\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "fea48715",
   "metadata": {
    "origin_pos": 7
   },
   "source": [
    "For missing numerical values, \n",
    "one common heuristic is to \n",
    "[**replace the `NaN` entries with \n",
    "the mean value of the corresponding column**].\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 4,
   "id": "37e8e900",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:31:21.123941Z",
     "iopub.status.busy": "2023-08-18T19:31:21.123273Z",
     "iopub.status.idle": "2023-08-18T19:31:21.132513Z",
     "shell.execute_reply": "2023-08-18T19:31:21.131522Z"
    },
    "origin_pos": 8,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "   NumRooms  RoofType_Slate  RoofType_nan\n",
      "0       3.0           False          True\n",
      "1       2.0           False          True\n",
      "2       4.0            True         False\n",
      "3       3.0           False          True\n"
     ]
    }
   ],
   "source": [
    "inputs = inputs.fillna(inputs.mean())\n",
    "print(inputs)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6d535cf",
   "metadata": {
    "origin_pos": 9
   },
   "source": [
    "## Conversion to the Tensor Format\n",
    "\n",
    "Now that [**all the entries in `inputs` and `targets` are numerical,\n",
    "we can load them into a tensor**] (recall :numref:`sec_ndarray`).\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": 5,
   "id": "d211233b",
   "metadata": {
    "execution": {
     "iopub.execute_input": "2023-08-18T19:31:21.137043Z",
     "iopub.status.busy": "2023-08-18T19:31:21.136126Z",
     "iopub.status.idle": "2023-08-18T19:31:23.159251Z",
     "shell.execute_reply": "2023-08-18T19:31:23.158224Z"
    },
    "origin_pos": 11,
    "tab": [
     "pytorch"
    ]
   },
   "outputs": [
    {
     "data": {
      "text/plain": [
       "(tensor([[3., 0., 1.],\n",
       "         [2., 0., 1.],\n",
       "         [4., 1., 0.],\n",
       "         [3., 0., 1.]], dtype=torch.float64),\n",
       " tensor([127500., 106000., 178100., 140000.], dtype=torch.float64))"
      ]
     },
     "execution_count": 5,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "import torch\n",
    "\n",
    "X = torch.tensor(inputs.to_numpy(dtype=float))\n",
    "y = torch.tensor(targets.to_numpy(dtype=float))\n",
    "X, y"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "2a3027b6",
   "metadata": {
    "origin_pos": 14
   },
   "source": [
    "## Discussion\n",
    "\n",
    "You now know how to partition data columns, \n",
    "impute missing variables, \n",
    "and load `pandas` data into tensors. \n",
    "In :numref:`sec_kaggle_house`, you will\n",
    "pick up some more data processing skills. \n",
    "While this crash course kept things simple,\n",
    "data processing can get hairy.\n",
    "For example, rather than arriving in a single CSV file,\n",
    "our dataset might be spread across multiple files\n",
    "extracted from a relational database.\n",
    "For instance, in an e-commerce application,\n",
    "customer addresses might live in one table\n",
    "and purchase data in another.\n",
    "Moreover, practitioners face myriad data types\n",
    "beyond categorical and numeric, for example,\n",
    "text strings, images,\n",
    "audio data, and point clouds. \n",
    "Oftentimes, advanced tools and efficient algorithms \n",
    "are required in order to prevent data processing from becoming\n",
    "the biggest bottleneck in the machine learning pipeline. \n",
    "These problems will arise when we get to \n",
    "computer vision and natural language processing. \n",
    "Finally, we must pay attention to data quality.\n",
    "Real-world datasets are often plagued \n",
    "by outliers, faulty measurements from sensors, and recording errors, \n",
    "which must be addressed before \n",
    "feeding the data into any model. \n",
    "Data visualization tools such as [seaborn](https://seaborn.pydata.org/), \n",
    "[Bokeh](https://docs.bokeh.org/), or [matplotlib](https://matplotlib.org/)\n",
    "can help you to manually inspect the data \n",
    "and develop intuitions about \n",
    "the type of problems you may need to address.\n",
    "\n",
    "\n",
    "## Exercises\n",
    "\n",
    "1. Try loading datasets, e.g., Abalone from the [UCI Machine Learning Repository](https://archive.ics.uci.edu/ml/datasets.php) and inspect their properties. What fraction of them has missing values? What fraction of the variables is numerical, categorical, or text?\n",
    "1. Try indexing and selecting data columns by name rather than by column number. The pandas documentation on [indexing](https://pandas.pydata.org/pandas-docs/stable/user_guide/indexing.html) has further details on how to do this.\n",
    "1. How large a dataset do you think you could load this way? What might be the limitations? Hint: consider the time to read the data, representation, processing, and memory footprint. Try this out on your laptop. What happens if you try it out on a server? \n",
    "1. How would you deal with data that has a very large number of categories? What if the category labels are all unique? Should you include the latter?\n",
    "1. What alternatives to pandas can you think of? How about [loading NumPy tensors from a file](https://numpy.org/doc/stable/reference/generated/numpy.load.html)? Check out [Pillow](https://python-pillow.org/), the Python Imaging Library.\n"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "634223ee",
   "metadata": {
    "origin_pos": 16,
    "tab": [
     "pytorch"
    ]
   },
   "source": [
    "[Discussions](https://discuss.d2l.ai/t/29)\n"
   ]
  }
 ],
 "metadata": {
  "language_info": {
   "name": "python"
  },
  "required_libs": []
 },
 "nbformat": 4,
 "nbformat_minor": 5
}